SoundCloud Analysis

Author

Hajara Muzammal

Introduction:

Music streaming platforms such as Spotify and SoundCloud have fundamentally reshaped how listeners discover, consume, and share music. Rather than relying on traditional radio play or physical media, modern listeners increasingly engage with curated playlists, algorithmic recommendations, and social sharing features to explore new content. As a result, playlist inclusion has become a key mechanism through which songs gain exposure and, potentially, popularity. Understanding the relationship between playlist presence and track popularity is therefore an important step toward explaining how success is generated on contemporary streaming platforms.

This project explores song popularity using a SoundCloud-linked dataset sourced from Hugging Face, which combines Spotify-style audio features with links to SoundCloud tracks. The dataset contains nearly 15,000 songs along with detailed musical characteristics such as danceability, energy, valence (mood), tempo, and popularity scores. By leveraging both playlist metadata and audio features, this analysis aims to shed light on how measurable song attributes relate to broader patterns of popularity and exposure.

The overarching research question guiding this analysis is: What factors influence the popularity of songs across major music streaming platforms? To address this broader question, the specific focus of this report is: Is track popularity associated with playlist inclusion? In other words, do more popular songs tend to appear in more playlists, and are certain musical characteristics more prevalent among popular tracks?

To answer these questions, this report combines exploratory data analysis, visualization, and basic statistical techniques to examine relationships between popularity, playlist appearances, and audio features. While the dataset has limitations, including incomplete playlist metadata and the absence of temporal information, nonetheless provides a valuable opportunity to explore how song characteristics and playlist exposure interact within a modern streaming context.

Data Ingest

We use a publicly available dataset hosted on Hugging Face, which contains playlist metadata, song characteristics, and direct SoundCloud links.

Show code
library(readr)
library(dplyr)
library(tidyr)
library(ggplot2)

url <- "https://huggingface.co/datasets/Zuru7/Spotify_Songs_with_SoundCloud_links/resolve/main/song_df_normalised.csv"
SONGS_raw <- read_csv(url, show_col_types = FALSE)

# Standardize names (works even if you re-run the doc)
SONGS <- SONGS_raw %>%
  rename(
    track          = any_of(c("track", "track_name")),
    artist         = any_of(c("artist", "track_artist")),
    album          = any_of(c("album", "track_album_name")),
    popularity     = any_of(c("popularity", "track_popularity")),
    playlist_genre = any_of(c("genre", "playlist_genre")),
    playlist_subgenre = any_of(c("subgenre", "playlist_subgenre")),
    soundcloud_link = any_of(c("soundcloud_link", "links"))
  ) %>%
  filter(!is.na(track), !is.na(artist), !is.na(popularity))
glimpse(SONGS)
Rows: 14,987
Columns: 23
$ track             <chr> "i feel alive", "poison", "baby it's cold outside (f…
$ artist            <chr> "steady rollin", "bell biv devoe", "ceelo green", "k…
$ lyrics            <chr> "the trees, are singing in the wind the sky blue, on…
$ album             <chr> "love & loss", "gold", "ceelo's magic moment", "kard…
$ popularity        <dbl> 28, 0, 41, 65, 70, 52, 36, 42, 1, 58, 69, 72, 74, 41…
$ playlist_name     <chr> "hard rock workout", "back in the day - r&b, new jac…
$ playlist_genre    <chr> "rock", "r&b", "r&b", "pop", "r&b", "r&b", "r&b", "e…
$ playlist_subgenre <chr> "hard rock", "new jack swing", "neo soul", "dance po…
$ danceability      <dbl> 0.2166860, 0.8447277, 0.3580533, 0.7462341, 0.440324…
$ energy            <dbl> 0.8779620, 0.6460897, 0.3674362, 0.8850809, 0.632868…
$ key               <dbl> 0.81818182, 0.54545455, 0.45454545, 0.81818182, 0.54…
$ loudness          <dbl> 0.7817377, 0.6813893, 0.7425419, 0.8813965, 0.730275…
$ mode              <dbl> 1, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 1, 1…
$ speechiness       <dbl> 0.02434122, 0.21616793, 0.01306387, 0.02065654, 0.03…
$ acousticness      <dbl> 0.011792960, 0.004353434, 0.694556021, 0.037297028, …
$ instrumentalness  <dbl> 0.010205339, 0.007422998, 0.000000000, 0.000000000, …
$ liveness          <dbl> 0.34221195, 0.48613476, 0.05781237, 0.13038190, 0.08…
$ valence           <dbl> 0.4080748, 0.6565622, 0.4090849, 0.2424166, 0.308073…
$ tempo             <dbl> 0.5545093, 0.4227024, 0.4605076, 0.5250801, 0.625378…
$ language          <chr> "en", "en", "en", "en", "en", "en", "en", "en", "es"…
$ sentiment         <chr> "Positive", "Positive", "Positive", "Negative", "Pos…
$ song_artist       <chr> "i feel alive steady rollin", "poison bell biv devoe…
$ soundcloud_link   <chr> "http://soundcloud.com/xobak3r/purple-vision-ft-xoro…

The dataset used in this analysis was obtained from Hugging Face and contains Spotify song metadata linked to SoundCloud URLs. This dataset is appropriate because it combines popularity metrics, playlist information, and detailed audio features while also enabling direct reference to SoundCloud content. The dataset consists of 14,987 observations and 23 variables, including track-level metadata, playlist attributes, audio features, sentiment labels, and SoundCloud links, making it well-suited for exploratory analysis of music popularity

Data Cleaning

Now lets clean the data.

Show code
PLAYLIST_TABLE <- SONGS %>%
  transmute(
    playlist_name   = playlist_name,
    artist_name     = artist,
    track_name      = track,
    album_name      = album,
    popularity      = popularity,
    playlist_genre  = playlist_genre,
    playlist_subgenre = playlist_subgenre,
    soundcloud_link = soundcloud_link
  )

glimpse(PLAYLIST_TABLE)
Rows: 14,987
Columns: 8
$ playlist_name     <chr> "hard rock workout", "back in the day - r&b, new jac…
$ artist_name       <chr> "steady rollin", "bell biv devoe", "ceelo green", "k…
$ track_name        <chr> "i feel alive", "poison", "baby it's cold outside (f…
$ album_name        <chr> "love & loss", "gold", "ceelo's magic moment", "kard…
$ popularity        <dbl> 28, 0, 41, 65, 70, 52, 36, 42, 1, 58, 69, 72, 74, 41…
$ playlist_genre    <chr> "rock", "r&b", "r&b", "pop", "r&b", "r&b", "r&b", "e…
$ playlist_subgenre <chr> "hard rock", "new jack swing", "neo soul", "dance po…
$ soundcloud_link   <chr> "http://soundcloud.com/xobak3r/purple-vision-ft-xoro…

After initial ingestion, the dataset was cleaned to focus on variables most relevant to playlist behavior and song popularity. The cleaned table contains 14,987 observations and 8 key variables, including playlist name, artist name, track name, album name, popularity score, playlist genre and subgenre, and a corresponding SoundCloud link. Tracks with missing popularity values were removed to ensure consistency across analyses, allowing all visualizations and comparisons to rely on complete and comparable observations. This streamlined structure provides a balanced combination of playlist context, song metadata, and platform linkage, making it well-suited for exploratory analysis of how musical characteristics and playlist placement relate to popularity.

Data Exploration

We define a “popular song” as one with a popularity that is greater than or equal to 70.

Show code
ppop_threshold <- 70
ppop_threshold
[1] 70
Show code
track_counts <- PLAYLIST_TABLE %>%
  distinct(playlist_name, track_name, artist_name, popularity) %>%
  count(track_name, artist_name, popularity, name = "playlist_appearances")

glimpse(track_counts)
Rows: 14,987
Columns: 4
$ track_name           <chr> "$20 fine", "$ave dat money (feat. fetty wap & ri…
$ artist_name          <chr> "jimi hendrix", "lil dicky", "max frost", "queen"…
$ popularity           <dbl> 44, 69, 43, 60, 0, 39, 83, 75, 50, 48, 55, 68, 5,…
$ playlist_appearances <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…

To better understand how song popularity relates to playlist exposure, I constructed a track-level summary table that aggregates playlist information across the dataset. Each row in this table represents a unique track and artist combination, along with the song’s popularity score and the number of playlists in which it appears. Notably, most tracks appear in only a single playlist regardless of their popularity score, indicating that playlist inclusion in this dataset is relatively sparse and not dominated by a small number of highly repeated songs. This aggregation allows for direct comparison between popularity and playlist appearances and serves as the foundation for subsequent visual analyses examining whether more popular songs tend to receive greater playlist exposure

Popularity vs Playlist Appearances

Show code
ggplot(track_counts, aes(x = popularity, y = playlist_appearances)) +
  geom_point(alpha = 0.35) +
  geom_smooth(method = "lm", se = FALSE) +
  labs(
    title = "Popularity vs Playlist Appearances",
    x = "Track Popularity",
    y = "Number of Playlist Appearances"
  ) +
  theme_minimal(base_size = 13)

This scatter plot examines the relationship between track popularity and the number of playlists in which a track appears. The majority of tracks cluster at a single playlist appearance regardless of popularity score, indicating that even highly popular songs are not necessarily included in multiple playlists. While a small number of moderately to highly popular tracks appear more frequently, the overall pattern shows no strong upward trend. This suggests that playlist inclusion is not strongly driven by popularity alone and may instead reflect playlist curation strategies, genre specialization, or user preferences.

Most danceable songs

Show code
SONGS %>%
  arrange(desc(danceability)) %>%
  select(track, artist, danceability, popularity, soundcloud_link) %>%
  slice_head(n = 5)
# A tibble: 5 × 5
  track                           artist danceability popularity soundcloud_link
  <chr>                           <chr>         <dbl>      <dbl> <chr>          
1 ice ice baby                    vanil…        1             70 http://soundcl…
2 cha cha slide - original live … dj ca…        0.999         54 http://soundcl…
3 funky friday                    dave          0.995         72 http://soundcl…
4 bad bad bad (feat. lil baby)    young…        0.994         81 http://soundcl…
5 cinnamon girl - radio edit      [dunk…        0.994         47 http://soundcl…

The table highlights the most danceable tracks in the dataset, with danceability scores approaching the upper bound of the metric. Notably, some of these tracks—such as “ice ice baby” and “bad bad bad”—also exhibit high popularity, while others remain relatively less popular despite their strong rhythmic characteristics. This reinforces the idea that danceability contributes to popularity but does not guarantee widespread success on its own

Danceability vs Popularity

Show code
ggplot(SONGS, aes(x = danceability, y = popularity)) +
  geom_point(alpha = 0.25) +
  geom_smooth(method = "lm", se = FALSE) +
  labs(
    title = "Danceability vs Popularity",
    x = "Danceability",
    y = "Popularity"
  ) +
  theme_minimal(base_size = 13)

This plot visualizes the relationship between danceability and popularity across all tracks. A weak positive trend is visible, indicating that tracks with higher danceability tend to be slightly more popular on average. However, the substantial dispersion of points shows that popularity varies widely at all danceability levels. This implies that while danceability may enhance a track’s appeal, it is only one of many factors influencing popularity.

Tempo vs Popularity

Show code
ggplot(SONGS, aes(x = tempo, y = popularity)) +
  geom_point(alpha = 0.25) +
  geom_smooth(method = "loess", se = FALSE) +
  labs(
    title = "Tempo vs Popularity",
    x = "Tempo",
    y = "Popularity"
  ) +
  theme_minimal(base_size = 13)

The relationship between tempo and popularity appears weak and non-linear, as shown by the relatively flat smoothed trend line. Popularity remains fairly stable across a wide range of tempo values, with no clear tempo range dominating popular songs. This indicates that tempo alone does not play a major role in determining a track’s popularity.

Statistical Analysis

Show code
# Correlation: Popularity vs Playlist Appearances
cor.test(track_counts$popularity, track_counts$playlist_appearances)

    Pearson's product-moment correlation

data:  track_counts$popularity and track_counts$playlist_appearances
t = NA, df = 14985, p-value = NA
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
 NA NA
sample estimates:
cor 
 NA 
Show code
# Correlations with audio features
cor.test(SONGS$danceability, SONGS$popularity)

    Pearson's product-moment correlation

data:  SONGS$danceability and SONGS$popularity
t = 7.2175, df = 14985, p-value = 5.548e-13
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
 0.04288832 0.07479784
sample estimates:
       cor 
0.05885811 
Show code
cor.test(SONGS$energy, SONGS$popularity)

    Pearson's product-moment correlation

data:  SONGS$energy and SONGS$popularity
t = -11.128, df = 14985, p-value < 2.2e-16
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
 -0.10638503 -0.07462696
sample estimates:
        cor 
-0.09052901 
Show code
cor.test(SONGS$tempo, SONGS$popularity)

    Pearson's product-moment correlation

data:  SONGS$tempo and SONGS$popularity
t = 1.7274, df = 14985, p-value = 0.08411
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
 -0.001900565  0.030113487
sample estimates:
       cor 
0.01411008 
Show code
cor.test(SONGS$valence, SONGS$popularity)

    Pearson's product-moment correlation

data:  SONGS$valence and SONGS$popularity
t = -0.73058, df = 14985, p-value = 0.465
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
 -0.02197611  0.01004317
sample estimates:
         cor 
-0.005968001 
Show code
# Linear regression
model <- lm(popularity ~ danceability + energy + tempo + valence, data = SONGS)
summary(model)

Call:
lm(formula = popularity ~ danceability + energy + tempo + valence, 
    data = SONGS)

Residuals:
    Min      1Q  Median      3Q     Max 
-53.807 -16.590   4.981  18.849  55.951 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept)   43.0421     1.2495  34.449  < 2e-16 ***
danceability   8.6653     1.2497   6.934 4.25e-12 ***
energy       -11.7020     1.1155 -10.491  < 2e-16 ***
tempo          6.0333     1.2975   4.650 3.35e-06 ***
valence       -0.7786     0.9470  -0.822    0.411    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 24.14 on 14982 degrees of freedom
Multiple R-squared:  0.01234,   Adjusted R-squared:  0.01207 
F-statistic: 46.78 on 4 and 14982 DF,  p-value: < 2.2e-16

The statistical analysis reinforces the exploratory findings by showing that while several audio features are statistically associated with track popularity, their practical influence is limited. Pearson correlation tests indicate a small but statistically significant positive relationship between danceability and popularity (r = 0.059, p < 0.001), suggesting that more danceable tracks tend to be slightly more popular on average. In contrast, energy shows a small but statistically significant negative correlation with popularity (r = −0.091, p < 0.001), implying that extremely high-energy tracks may be less favored overall. Tempo exhibits a very weak and statistically insignificant relationship with popularity (r = 0.014, p = 0.084), while valence shows no meaningful association (r ≈ −0.006, p = 0.465). A multiple linear regression model confirms these patterns: danceability and tempo have positive, statistically significant coefficients, energy has a significant negative coefficient, and valence remains non-significant. However, the model explains only about 1.2% of the total variation in popularity (R² = 0.012), indicating that audio features alone account for very little of what drives popularity. Taken together, these results suggest that although certain musical characteristics are statistically detectable predictors of popularity, playlist inclusion and broader popularity on streaming platforms are likely driven primarily by external factors such as marketing, artist reputation, algorithmic promotion, and social dynamics rather than intrinsic audio features alone.

Conclusion:

This analysis examined factors influencing song popularity across streaming platforms by focusing on whether playlist inclusion is associated with popularity. Visual exploration showed a weak relationship between playlist appearances and popularity, with most tracks appearing in only one playlist regardless of popularity level. Statistical tests confirmed this finding, as the correlation between playlist appearances and popularity was undefined due to limited variation. In contrast, both visualizations and Pearson correlations revealed that danceability has a small but statistically significant positive association with popularity, while energy showed a modest negative relationship. A multivariate regression reinforced these results, indicating that audio features explain only a small fraction of popularity, suggesting that playlist inclusion and musical attributes alone are insufficient to fully account for popularity dynamics.